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ABSTRACT 

Increasing attention has focused on the significance 
of RNA in sperm, in light of its contribution to the 
birth and long-term health of a child, role in sperm 
function and diagnostic potential. As the compos- 
ition of sperm RNA is in flux, assigning specific 
roles to individual RNAs presents a significant chal- 
lenge. For the first time RNA-seq was used to char- 
acterize the population of coding and non-coding 
transcripts in human sperm. Examining RNA repre- 
sentation as a function of multiple methods of 
library preparation revealed unique features indica- 
tive of very specific and stage-dependent matur- 
ation and regulation of sperm RNA, illuminating 
their various transitional roles. Correlation of 
sperm transcript abundance with epigenetic marks 
suggested roles for these elements in the pre- and 
post-fertilization genome. Several classes of 
non-coding RNAs including IncRNAs, CARs, 
pri-miRNAs, novel elements and mRNAs have been 
identified which, based on factors including relative 
abundance, integrity in sperm, available knockout 
data of embryonic effect and presence or absence 
in the unfertilized human oocyte, are likely to be es- 
sential male factors critical to early post-fertilization 
development. The diverse and unique attributes of 
sperm transcripts that were revealed provides the 
first detailed analysis of the biology and anticipated 
clinical significance of spermatozoal RNAs. 

INTRODUCTION 

Sperm structure and function appear fine-tuned towards a 
single purpose — the delivery of the paternal content to the 



oocyte in the most compact and accurate form. Our view 
of the presence and role of RNAs in human sperm has 
dramatically evolved (1). As sperm are generally con- 
sidered transcriptionally inert, RNAs detected in the 
paternal gamete were initially assumed to be either 
content left behind after degradation and expulsion of 
the residual body, or simply contaminants from other sur- 
rounding cells (2,3). However, sperm retain specific coding 
(4-8) and non-coding RNAs (9,10). Irregularities in the 
levels of sperm RNAs have been recognized as markers 
and potential effectors of human male infertility (11-13). 
Their functional role on delivery to the oocyte has been 
suggested (14-16). 

Previous studies have used microarray approaches to 
identify transcripts retained in sperm (12,17,18). 
Although informative, these analyses were unable to 
provide the enhanced view of this suite of RNAs 
afforded by RNA-seq [reviewed in (19)]. Apart from 
being recognized as a more accurate assessment of tran- 
script levels than microarrays (20,21), RNA-seq has the 
added advantage that novel tissue isoforms and variants 
can be identified and quantitatively evaluated. Further, 
previously unrecognized coding and non-coding tran- 
scripts can be discovered and potential function ascribed 
(22-25). The population of sperm RNAs as revealed by 
RNA-seq provides a window into the developmental 
history, functional viability and potential elements de- 
livered by sperm that are likely of significance upon fertil- 
ization. Together with the previous sncRNA analysis (10), 
this study provides a comprehensive snapshot of the tran- 
script composition of human sperm among several indi- 
viduals. Comparison of the many genomic regions 
encoding sperm RNAs with elements of the sperm 
epigenome has provided insight into the function and 
temporal action of these regulatory features. Highlighted 
in the analysis below is the use of exon profiles to identify 
full-length transcripts amongst a larger population of 
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fragmented RNAs. Global analysis of this unique suite of 
paternal transcripts echoes past transcriptional history 
and late-stage spermatogenic function, yet projects its 
role on fertilization. 



MATERIALS AND METHODS 

RNA isolation and sequencing 

To characterize various classes and features of sperm 
RNA, multiple sample preparation methods were used. 
The first set of samples was prepared as described 
(10,26,27) using RNA from a total of six sperm samples 
from three random and three proven fertile donors and 
two testes samples. Briefly, four of these sperm samples 
(henceforth referred to as T-Sperm samples) — 
Sp_D62[Tt], Sp_D64[Tt], Sp_D66[Tt] (individual donors) 
and Sp_Pl[Tt] (mixed pool of three random donors) — 
were not subject to any of the commonly used RNA-se- 
lection methods to fully resolve the RNA population and 
reduce transcript expression profile bias. With the excep- 
tion of the Pooled sample Sp_Pl[Tt], all samples were 
subject to 50% PureSperm (Nidacon, Molndal, Sweden) 
gradient to purify spermatozoa away from other 
contaminating cell- types. RNA from donor D62 was add- 
itionally separated into Poly(A + ) (Sp_D62[A + ]) and 
Poly(A~) (Sp_D62[A~]) fractions by oligo(dT) selection, 
with each fraction sequenced separately. Sperm samples 
were all sequenced in two lanes each on Illumina GAIIx 
Genome Analyzer. Two samples of commercially obtained 
pooled testes RNA — Te_PAm[A + ] (Applied Biosystems/ 
Ambion, Austin, TX, USA, Lot 054P010702031A) and 
Te_PCl[A + ] (ClonTech, Mountain View, CA, USA, Lot 
3090051) — were each sequenced in a single lane after 
Poly(A + ) selection. 

Single Primer Isothermal Amplification (SPIA) was 
used to prepare a second set of RNA-seq libraries from 
20 ng of RNA from each of the three individual donors 
(Sp_D62[SP], Sp_D64[SP], Sp_D66[SP]), a single pooled 
sperm sample (Sp_P2[SP]) and a single pooled testes 
(Te_PAm[SP]) sample using the Nugen Ovation kit 
(Nugen Inc., San Carlos, CA, USA) for cDNA and 
initial linear amplification. The Nugen Encore system 
was used for adaptor ligation, end-repair and amplifica- 
tion. Samples were multiplexed at four samples per lane. 

All samples were subjected to 2 x 36 cycles of 
paired-end sequencing on the Illumina GAIIx platform. 
Paired-end mapping was performed equivalently for all 
samples using Novoalign (V2.07.09, Novocraft 
Technologies, Selangor, Malaysia) with alignment to 
Human Genome build hgl9, and also including ribosomal 
18S and 28S and mitochondrial 12S and 16S indices for 
specific alignment of reads corresponding to these se- 
quences. Reads mapping to multiple genomic locations 
were examined separately. Sequencing data are available 
at the NCBI GEO repository. 

An SPIA-prepared RNA-seq Universal Human 
Reference (UHR; Nugen Ovation library, Nugen Inc.) 
sample composed of RNAs from pooled somatic tissues 
was used for comparison of the parallel SPIA-prepared 
sperm and testes RNAs with a wide representation of 



somatic RNAs. Data from this sequencing are publically 
viewable by including the text 'track type = bigWig 
name = "repl23.uhr.7102" description = "repl23.uhr. 
7102" color = 5 20 198 autoScale = on visibility = 2 
viewLimits = 1:500 bigDataUrl = http://ucscdatahosting. 
com/labuser/Tech_Support/repl23.uhr.7102.bw' in a user 
track window of the UCSC genome browser at http://gen 
ome.ucsc.edu/cgi-bin/hgCustom?hgsid = 322010991 : 

RT-PCR-transcript validation 

Remaining total RNA from Sp_D62[Tt], Sp_D64[Tt] and 
Sp_D66 were reverse-transcribed using SuperScriptHI 
(Invitrogen, Carlsbad, CA, USA) with either oligo(dT) 
(Sp_D62 and Sp_D64 [Tt]) or gene-specific primers 
(Sp_D66 [Tt]) then amplified with HotStarTaq (Qiagen, 
Valencia, CA, USA), as described (28). PCR primer 
pairs were designed to query the entire transcript length 
(Supplementary Table SI). Products were visualized by 
agarose gel electrophoresis and images captured on the 
Typhoon 9210 scanner (Molecular Dynamics). 

Bioinformatic analysis 

Total transcript read counts from each sample were 
measured for 37 973 different isoforms (RefSeq annota- 
tion, HG19) from a total of 22 302 genes. Sperm RNAs 
contain a host of transcripts exhibiting multiple 
polyadenylation sites, intronic elements, or reflect differ- 
ential processing (29). Accordingly to minimize this effect 
and to quantify transcript levels, a custom algorithm was 
developed to count exonic reads from all sequenced RNA 
samples across protein-coding regions only for all 
annotated transcript isoforms (See Supplementary 
Methods). The RPKM (Reads per Kilobase per Million 
reads) values were then calculated for both Total RNA 
and SPIA samples as summarized in Supplementary 
Tables S2 and S3. The aggregate of four T-sperm and 
SPIA-sperm samples was obtained by pooling all 
samples and renormalizing RPKM values based on total 
counts. The most abundant isoform of each gene (RPKM) 
was then used to rank each gene in each individual sample, 
allowing for intersample comparison by Spearman rank 
correlation. Transcript levels in sperm and testes were 
additionally assessed using the Genomatix RegionMiner 
(v2.41207), as this data source also includes a series of 
non-coding transcripts (e.g., RNU) in its assessment of 
transcript abundance. Ontological analysis of selected 
transcript groups was performed using GeneRanker 
(Genomatix v2.41103). Post-fertilization significance of 
sperm-delivered RNAs was examined using expression 
data from microarray analysis of human oocyte at Mil, 
2-, 4-, 6-, 10-cell, morula, blastocyte and stem-cell stage 
(30). Relative differential expression for each transcript 
across these developmental stages was assessed by 
calculating the change in reported expression (log 2) at 
each stage relative to Mil baseline. Predicted targets of 
notable sperm-delivered miRNAs were calculated using 
Diana-microT algorithm (v3.0; 31,32) as described (10). 
High confidence targets (>0.95%) were selected for down- 
stream analysis based on strict predicted binding score 
(miTG >25). 
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An intactness score was developed to assess the stability 
of total sperm RNAs with a length of at least 200 nt. To 
portray the relative level of coverage over transcript 
length, RNA-seq expression profiles for selected tran- 
scripts were obtained by computationally splicing read 
coverage corresponding to the transcript body 
(Supplementary Figure SI). Analysis was based on the 
annotated exon map of the most abundant isoform for 
each sperm transcript. Bias due to lack of read coverage 
at exon ends was minimized by excluding the last 1 5 nt of 
each exon. Normalized coverage profiles were calculated 
across transcripts in 100 bins. The sum of squares of 
actual deviation from the expected coverage of an ideal 
transcript (1% of total coverage/bin) was used as the in- 
tactness score. Transcripts were additionally and separ- 
ately categorized and ordered based on positioning of 
underrepresented or absent regions as a function of the 
relative degree of 5' versus 3' coverage. A linear scaling 
function from the 5'-end maximum positive value to the 
maximum 3'-end negative value was applied to the 
normalized profiles, with subsequent ranked degree of 
variation from a level profile now indicative of the 
degree and site of end degradation. 

To examine the association of RNA abundance in 
sperm with local epigenetic marks, 19 521 expression- 
enhanced promoter and exonic cluster regions from 
sperm sample Sp_D62[Tt] were identified using 
Genomatix RegionMiner tool (Audic-Claverie algorithm 
with P<0.05, window size of 100 bp). External data sets 
were used for correlation analysis of H3K4me3 (34 912 
regions), H3K27me3 (38 337 regions) and histone- 
enriched (25114 regions; 33) and hypomethylated 
regions (79 124 regions; 34). Identified regions as 
reported by authors in hgl8 genome build coordinates 
were converted to hgl9 coordinates using the UCSC 
Genome Browser liftover tool. Minimum distance of 
center point of RNA clusters to closest local element of 
each histone region or hypomethylated DNA region set 
was calculated using Genomelnspector (Genomatix), 
yielding a correlation graph showing the distribution of 
clusters over a ± 10 kb region relative to the center of each 
element. 



RESULTS AND DISCUSSION 

Two separate protocols were used to construct sequencing 
libraries from Total and poly(A) fractionated sperm and 
testes RNAs. Sequencing characteristics for all sample 
libraries are summarized in Supplementary Table S4. 
Reads not aligned (total reads — aligned reads) are primar- 
ily those rejected due to a low Quality Control score. 
Reads that could not be assigned to a unique genomic 
location despite paired-end mapping were considered sep- 
arately in subsequent analyses (below). Although it is 
assumed that sequencing libraries prepared from Total 
RNA can more accurately portray the complete RNA 
profile of a cell, alternative methods of preparation are 
important, as they significantly reduce the contribution 
of the abundant ribosomal and mitochondrial and frag- 
mented RNAs. In this regard, it has been shown that 28 S 



and 18S ribosomal RNAs are essentially fragmented in the 
mature gamete ensuring translational silencing (26,35). 
This precludes their depletion by available technologies 
and hinders efficiently achieving adequate sequencing 
depth. Only those sperm RNAs of subsequent biological 
significance are expected to escape fragmentation. 
Acknowledging how sequencing library construction 
impacts transcript representation was essential in the char- 
acterization of the unique population of RNAs retained in 
sperm. 

Library characteristics 

Both RNA transcript abundance and read profile were to 
some degree dependent on the protocol used for 
sequencing library construction. Average fragment 
length of the SPIA-prepared sequencing libraries was 
~ 140 bp, while T-libraries (Total RNA) averaged 
between 70-90 bp. This increased the representation of 
transcripts of <100nt in this set, enabling the resolution 
of sno and snaR RNAs. As expected, the poly(A + ) sperm 
and testes samples showed a significant decrease in the 
relative abundance of both mitochondrial and ribosomal 
RNAs compared with the total RNA libraries. Although 
SPIA did not use a poly(A + ) selection, the significant de- 
pletion of rRNA achieved by this method results from the 
use of a heterogeneous mixture of primers incapable of 
annealing to the highly structured ribosomal transcripts. 
The SPIA library preparation protocol provided an add- 
itional advantage over other strategies, as it required sub- 
stantially less template RNA. This is of particular concern 
when investigating sperm RNAs (36). Briefly, total RNA 
was reverse transcribed using a proprietary mixture of 
chimeric RNA/DNA primers. Use of these hybrid oligo- 
nucleotides allows for continual priming and polymeriza- 
tion following second strand synthesis (37). 

To assess library purity, a group of 523 transcripts were 
selected that were abundant in SPIA-testes (RPKM > 25, 
approximately 90th percentile), yet were relatively low or 
absent in the SPIA-sperm samples [RPKM 
(Aggregate) < 1 ; Supplementary Table S5]. RNAs that 
were depleted from mature sperm include those associated 
with Sertoli cells (22 genes, P = 1.2 e-04) and epididymis 
(22 genes, P = 4.1 e-06), confirming the absence of 
contaminating RNAs from surrounding cells. Other onto- 
logical groups of note include the absence of the large 
ribosomal subunit (11 genes, P = 1.7 e-06), and 
ribosome (16 genes, P = 1.9 e-04). This is consistent 
with the view that ribosomal fragmentation occurs 
earlier during the maturation phase of spermiogenesis, 
ensuring cessation of translation (26). 

As exhibited by ACSBG2 and shown in Figure 1, the 
majority of T- and SPIA-library prepared transcripts dis- 
played similar sequence-read profiles. However, clear dif- 
ferences were apparent for several transcripts, including 
PRM2 (Supplementary Figure S2), PRM1 and several 
non-coding RNAs. Analyses presented below for sperm 
transcript profiling used the T-libraries. Comparisons 
between sperm and other tissues used equivalently 
prepared SPIA-libraries. 
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Figure 1. The distribution of sequencing reads across ACSBG2 in all libraries. The green arrow indicates transcript orientation. The number of reads 
corresponding to each read start at each base position is represented on the vertical axis. The scale for each sample is based on maximum read count 
within the displayed region. Four Total Sperm samples — Sp_D62[Tt], Sp_D64[Tt], Sp_D66[Tt] (individual donors) and Sp_Pl[Tt] (mixed pool of 
three random donors) — were not subject to any of the commonly used RNA-selection methods. Sample D62 was additionally separated into 
Poly(A + ) (Sp_D62[A + ]) and Poly(A") (Sp_D62[A"]) fractions by oligo(dT) selection. Commercially obtained pooled testes RNA— Te_PAm[A + ] 
(Applied Biosystems/Ambion, Austin, TX, USA, Lot 054P0 10702031 A) and Te_PCl[A + ] (ClonTech, Mountain View, CA, USA, Lot 3090051)— 
were subject to Poly(A + ) selection. Sequences from Single Primer Isothermal Amplification (SPIA — Nugen Ovation Nugen Inc., San Carlos, CA, 
USA) (Sp_D62[SP], Sp_D64[SP], Sp_D66[SP]), a single pooled sperm sample (Sp_P2[SP]) and a single pooled testes (Te_PAm[SP]) prepared libraries 
are compared. 



Comparison of rank transcript abundance between 
samples as measured by RNA-seq yielded an average 
Spearman rank correlation between the four SPIA- 
sperm samples of p = 0.83 (0.76-0.89). The correlation 
between individual SPIA-sperm samples to SPIA-testes 
was p = 0.57 (0.54-0.61), while the correlation of SPIA- 
sperm with a somatic UHR sample was p = 0.38 (0.35- 
0.40) and SPIA-testes with UHR was p = 0.67. These 
values reflect both the unique composition of sperm tran- 
scripts and their stability compared with somatic cells. 
Although the correlation between sperm sample transcript 
rank was acceptable, the absolute measure of abundance 
of many transcripts varied between sperm samples. This 
may reflect the wide-ranging degrees of fragmentation 
and/or the undirected expulsion of many RNAs during 
maturation. 

Multiple-mapped reads 

One issue uniquely encountered in RNA-seq as compared 
with microarray analysis is the treatment of reads 
mapping to multiple genomic locations (38,39). Many 
RNA-seq studies simply ignore these reads, assuming 
that within a coding sequence, 'repetitive' elements are 



of marginal significance. However, there are many genes 
for which identical sequences of the complete or partial 
transcript are annotated in multiple genomic locations. 
Sequence reads derived from these transcripts would 
map equivilently to all these locations. For example, tran- 
scripts like, SPANXB1/B2/F1 appear replicated across 
multiple locations. In comparison, others exhibit limited 
reiteration of sequence spans, e.g., the 3' exon of STK19, 
across multiple locations. Accordingly, if multiple align- 
ment reads were discarded, transcript levels would be 
severely underestimated as highlighted by the sperm- 
associated SPANX transcript (40). Accordingly, all 
multiple-mapped reads were separately retained for tran- 
scripts exceeding RPKM > 2, and data are summarized in 
Supplementary Table S6; sample Sp_D62[Tt]. 

In comparison with the above, transcript levels 
determined by probe hybridization are not affected by 
multiple genomic origins of individual RNAs, as the 
level of each is estimated as an aggregate for that probe. 
For example, when comparing the recent microarray- 
based survey of human round spermatid RNAs (41) 
several of these transcripts are virtually absent from the 
RNA-Seq-processed data sets when multiple-mapped 
reads are ignored (Supplementary Tables S2, S3 and S6). 
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This apparent discordance is resolved when multiple reads 
are correctly assigned, e.g., SPANX genes, histone 
HIST2HAA3/4, CXorf51, RIMBP3/3B/3C and HSFY1/ 
2. This independent validation suggests that other 
multiple-mapped transcripts, e.g., snaRs, require 
consideration. 

Alternative polyadenylation and isoforms 

Alternative polyadenylation (APA) is a transcriptional 
modification whereby the 3'UTR is cleaved at a secondary 
poly(A) recognition site, giving rise to a transcript possess- 
ing an unaltered coding sequence and a significantly 
shortened cell-specific 3' UTR (42-45). EFHD2 RNAs in 
the male reproductive tract offer an example of APA. This 
transcript is represented in all sperm RNA-seq data sets, 
and its coding sequence shown to be intact by RT-PCR 
(Supplementary Figure S3). But unlike that observed in 
testes, the UTR of this RNA is truncated in sperm relative 
to annotated transcripts (Figure 2). Approximately 20% 
of sperm transcripts exhibit clear APA sites relative to the 
full-length y UTR of the primary annotated transcripts. 
Interestingly, more than one-half are sperm-specific and 
ontologically nondescript (Supplementary Table S7). 
This modification has been shown to influence mRNA 
stability and would impact relative abundance (46,47). 

As determined by comparison of the T-libraries, several 
novel isoforms appear to be unique to sperm. These 
include the multi-functional pyruvate kinase PKM2 that 
is closely associated with the regulation of cellular energy 
pathways (48). Of note, conservation of this sperm tran- 
script extends to Arabidopsis thaliana sperm homologue 
AT4G26390 (49) suggesting significance. In human 
sperm, this mRNA possesses two additional exons, 
which are detected at levels comparable with the remain- 
der of the transcript (Figure 2B; boxes). Interestingly, each 
exon has been observed in other isoforms in testis and 
amygdala. Individual mapping of paired-end reads indi- 
cates that both exons are present in novel sperm-specific 
isoforms not observed in testes, the UHR RNAs or in 
unfertilized human oocytes (50). The sperm PKM2 
isoforms also exhibit mutually exclusive use of exons 9 
or 10. The exon 9 isoform is linked with aerobic respir- 
ation oxidative phosphorylation and plays a role in ATP 
production in sperm flagella (51). The M2 exon 10 isoform 
is a member of the aerobic glycolytic pathway often 
observed in growing cells (52). Delivery of these unique 
sperm transcripts may provide a paternal regulator which 
silences OCT 4 maintaining the pluripotent state (53) as 
required post-fertilization. 

Sperm/testes-associated transcripts 

Transcripts that are abundant in both testes and sperm 
(RPKM > 25, SPIA-samples), but are absent or at very 
low levels in the UHR sample (RPKM < 1) were also ap- 
parent. This set of 102 transcripts (Supplementary Table 
S8) comprises those that are present throughout sperm- 
atogenesis and retained through the final stages of matur- 
ation. This group was independently validated in a recent 
testes microarray data set (41). Within this set of 102 
RNA-seq sperm transcripts, 90 transcripts were 



interrogated by microarray probes, of which 42 spermatid 
transcripts were concordant. These RNAs are likely 
critical to gametogenesis, e.g., structural components — 
flagellum (six genes, P = 4.2 e-09), acrosomal vesicle 
(five genes, P = 1.1 e-06) and nucleosome (three genes, 
P = 2.2 e-03). Of particular significance are the large 
number of genes (22 genes, P = 1.6 e-11) in this group 
that are associated with infertility. Mouse knockout data 
from MGI (54) shows that 13 are associated with a male 
factor contribution to infertility (Supplementary Table 
S8). This includes PCYT2, which is only found in testes 
and sperm. It is the primary regulator of the CDP- 
ethanolamine pathway, which when deleted, confers 
complete embryonic lethality (55). One can pose that 
like PCYT2 those that await characterization may simi- 
larly contribute other essential male factors. 

Abundant sperm transcripts 

Of the 22 302 unique transcripts surveyed, 726 displayed 
RPKM levels greater than 50 in the aggregate of T-sperm 
libraries and were thus considered as representative of the 
highly abundant sperm transcript class (Supplementary 
Table S9). Of these transcripts, 565 were identified by 
ontological analysis as linked to testes (P = 6.5 e-48), 54 
were spermatid related (P = 3.2 e-19), 44 were associated 
with spermatogenesis (P = 3.0 e-15) and 76 were linked to 
infertility (P = 7.8 e-09). This is consistent with these 
RNAs providing a critical male function and suggests 
that they may provide a resource of diagnostic markers 
of fitness (56). Enrichment of genes associated with sperm 
motility (six genes, P = 4.6 e-05), acrosomal vesicle (eight 
genes, P = 1.4 e-04), components of the RAN path- 
way (four genes, P = 4.4 e-04) associated with regulation 
of centrosomal activity during mitotic division of 
spermatogonia (57,58), and structural components of 
sperm flagella such as cytoskeleton (84 genes, P = 1.3 
e-06) and microtubule (29 genes, P = 1.9 e-07), clearly 
reflect transcripts coding for critical functions. On one 
hand, it is likely that some abundant sperm transcripts 
including the protamine genes that are required earlier in 
spermatogenesis and RNAs associated with nuclear or- 
ganization (10 genes, P = 5.1 e-08) and the chromosome 
(33 genes, P = 6.3e-04) are no longer required in fully 
differentiated sperm and are simply residual. On the 
other hand, some abundant sperm transcripts may also 
function post-fertilization. For example, RANBP2, one 
of the intact sperm RNAs, is known to play an important 
role in nucleocytoplasmic transport and mitosis, reduction 
of which directly effects aneuploidy (59). This raises 
the intriguing possibility that RANBP2 may be an 
example of a sperm-delivered transcript encoding a 
factor specifically involved in restructuring the paternal 
chromatin following fertilization. Consistent with a 
post-fertilization function, its knockout confers embryonic 
lethality (60,61). 

Sperm-enhanced transcripts 

Sperm transcripts appear to encompass a wide spectrum 
of RNAs that encode proteins spanning past, current and 
future roles. Comparison of the sperm and testes RNAs 
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base position is represented on the vertical axis. The scale for each sample is based on maximum read count within the displayed region. (A) The 
complete annotated 3' UTR of EFHD2 is 1.6 kb in length (pink box). In contrast to testes the 3' UTR of this RNA in sperm is truncated to ~100nt 
due to alternative polyadenylation. (B) An isoform of PKM2 specific to sperm contains exons (pink ovals) not observed together in other tissues. 
Examination of paired reads from these regions indicates splicing between these neighbouring exons and that these regions are transcribed at levels 
comparable with the remainder of the transcript. Previously reported isoforms containing either the Ml or M2 exon are presented below. 



(SPIA-libraries) permits delineation of the transcripts and 
transcript groups that appear significant in sperm by their 
conspicuous abundance relative to testes. As transcription 
is believed to halt in the latter stages of spermiogenesis, the 
majority of RNAs which can be classified as present in 
sperm also appear in testes. Of the 2200 RNAs in the 
top decile of sperm-abundant transcripts, >85% are also 
present in testes at levels that are at least 50% of that 
observed in sperm. Most notable, however, are a 
number of sperm RNAs that appear to increase 
markedly in the latter stages of spermatogenesis, based 
on comparison of relative abundance of relatively intact 
transcripts in parallel sperm and testes libraries. This 
group likely represents a final burst of transcriptional 
activity during spermatid differentiation responsible for 
producing those RNAs with functions specific to the 
final stages of spermiogenesis or function subsequent to 
delivery. For example, among this group are keratins 



KRT33A, KRT5 and JUP, which are necessary structural 
components of sperm cytoskeleton and microtubules 
(62,63). Another RNA in this class is SIX 3, a member 
of the SIX family of transcription factors, which 
includes the newly identified sperm-associated SIX5, 
critical to spermiogenesis (64) that is associated with 
DNA binding (65). Also in this group and identified by 
multiple-mapped reads, is the RNA encoded by the 
Y-chromosome gene HSFY 1/2 that is deleted in some 
cases of severe male infertility (66). Potential post- 
fertilization effector roles also include signalling pathway 
members like SOCS1 that is encoded immediately down- 
stream of the protamine locus (67). It is both a regulator 
of the JAK/STAT signal-transduction pathway (68) and 
acts to block insulin action (69), perhaps regulating early 
embryonic growth. Similarly, NRARP, a coordinator of 
Notch and Lefl -dependent Wnt signalling (70) and 
EGR3, an early growth response regulator of genes 
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controlling biological rhythm, were also identified as 
members of this sperm-enhanced class. 

Non-coding RNAs 

The classes of small non-coding sperm RNAs have been 
previously described (10). Several classes of larger non- 
coding transcripts are also present at significantly higher 
levels within the SPIA-sperm libraries relative to testes. 
This includes members of the small nuclear RNU family 
of RNAs. Together with protein factors, these transcripts 
comprise both the major (nuclear) and minor (cytosolic) 
spliceosomes. In addition to their role as components of 
spliceosome, these elements have also been found to be 
integrally associated with polyadenylation (71). Sperm 
RNU transcripts above the 80th percentile of abundance 
yet essentially absent in SPIA-Testes include RNU11, 
RNU4-U RNU4-2, RNU5A, RNU5E and RNU6ATAC. 
As sperm are considered to be transcriptionally and trans- 
lationally silent, there is presumably no active role for a 
spliceosome in the mature spermatozoon even though as 
independently shown (e.g., RNU11, Supplementary 
Figure S3) they appear intact. The abundance of these 
transcripts in sperm despite the absence of spliceosome 
activity suggests that they may serve a role on delivery 
to the oocyte. In late metaphase II (Mil), prior to fertil- 
ization and until zygotic genome activation, the human 
oocyte is transcriptionally silent (72,73). Coincident with 
embryonic genome activation, dormant ribosomal and 
spliceosomal machinery must be reactivated (74). 
Perhaps the sperm RNU transcripts activate dormant 
zygotic post-transcriptional RNA-processing pathways. 

Several snaR (small NF90-associated RNAs) tran- 
scripts including snaR-C3, -C4, -E, -F, -Gl and -/ are 
also abundant in sperm compared with testes. These 
novel small non-coding RNAs are unique to human and 
chimpanzee and are embedded in genomic regions that 
contain many genes important to reproduction, develop- 
ment and regulation of male fertility (75). Of particular 
note is the observation that all snaR elements also overlap 
hypomethylated regions (HMRs) in sperm. These regions 
(34) occupy ~4% of the total genome, and transcripts in 
the vicinity of these regions are abundant in sperm (dis- 
cussed below). The observation that these regions overlap 
snaR elements suggests that local chromatin is specifically 
organized with respect to these transcripts. Even though 
their precise role remains to be determined, the following 
two observations require consideration. First, snaR-Gl is 
both intact as shown in Supplementary Figure S3, and one 
of the most abundant RNAs within the sperm-enhanced 
class. Second, the sequence encoding this RNA directly 
overlaps the predicted promoter region for the beta 1 
subunit (CGB1) of chorionic gonadotropin glycoprotein 
hormone (CG/3), a regulator of early placental develop- 
ment and implantation (76). 

A number of other novel classes of ncRNAs (non- 
coding) were additionally noted to be uniquely abundant 
in sperm as compared with testes and observed at rela- 
tively similar levels in all sperm samples examined. 
Included among these are at least 250 RNAs that appear 
to be a specific intron of transcripts expressed in testes, but 



for which the spliced mature form is no longer present in 
sperm (Supplementary Table SI 2). Recent studies have 
suggested that intron-encoded RNAs may function as pre- 
cursors of miRNAs or as regulatory factors (77,78). Some 
sperm RNAs appear as relatively short regions 
overlapping the UTRs or individual exons of several low 
abundance sperm RNAs (Supplementary Table SI 3). 
Local sequence analysis suggests that these regions are 
expressed independently of the transcripts in which they 
are located and that some may possess a poly(A) consen- 
sus site (AATAAA) indicative of antisense transcription 
(79-81). Sense-antisense regulation is a pervasive feature 
of transcriptomes and contributes to gene silencing, select- 
ive transcript editing, promoter inactivation and epigen- 
etic modification (82-86). Previous studies in other tissues 
have shown that such transcripts are often independently 
regulated and only co-expressed with a corresponding 
sense transcript target at specific stages of development. 
Additionally, many RNAs present in sperm overlap 
regions identified by Gencode annotation (preV13) as 
either IncRNAs or as antisense elements (Supplementary 
Table S14) or isoforms. A final class of RNAs for which 
there is evidence in sperm is that of chromatin-associated 
RNAs (CARs). These RNAs are transcribed from 
intergenic or intronic regions of the genome, are found 
to bind to DNA and may play a role in regulation of 
DNA architecture or transcriptional regulation (87). 
Several previously identified CAR regions of the genome 
are transcribed at significant levels in sperm 
(Supplementary Table SI 5). Further work is being done 
to more fully characterize and assess the functional im- 
portance of each of these ncRNA classes in sperm. 

Pri-mir-RNAs 

In addition to the previously characterized population of 
mature sperm micro-RNAs (10), the paternal gamete also 
harbours a number of ~200-300nt micro-RNA precur- 
sors (pri-miRNAs) that are essentially absent in testes. 
Examples of pri-mirRNAs present at high levels in 
sperm include pri-mir-1181, -miR-3648, -miR-3687, 
-mir-663 and -mir-181c. The most abundant of these, 
pri-mir-181c, is not observed in its shorter mature form 
in sperm (10) nor in either form in testes 
(Supplementary Figure S4). This strongly suggests that 
this transcript is delivered to the oocyte as a precursor 
mir-RNA requiring post-fertilization activation. The 
ability of fertilized mouse oocytes to process sperm-borne 
pri-mi-RNAs into mature forms capable of modulating 
target RNAs has been demonstrated (16). To evaluate 
the potential post-fertilization role of mir-181c, 27 
high-confidence targets were computationally identified. 
Several of the putative targets including FIGN, 
ONECUT2, ZNF197, POU2F1, ROD1, ACRV2A, 
ACRV2B, TCERG1 and FOXP1 are associated with dif- 
ferentiation (77), while TNRC6B is linked to the regula- 
tion of miRNA pathway (78). Their average relative level 
as a function of the 2-, 4-, 6-, 10-post-fertilized human 
oocyte (30) is illustrated in Figure 3 and detailed in 
Supplementary Table S10. Greater than 70% of the 
targets decreased in expression from the oocyte to 
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Embryo Stage 
miR-181c Targets ALL gene probes (32,321) 

Figure 3. Average level of miR-181C targets as a function of 
preimplantation embryonic development. The average level of 27 
high-confidence predicted miR-181C targets through preimplantation 
embryonic development OO: Oocyte, M:Morula, BL:blastocyst, 
ST: Stem cell is shown in blue. This is compared to the relative level 
of non-targeted genes indicated in red. 



morula stage, with nine of these decreasing by a factor of 2 
or more. Downregulation of targets from oocyte to 
morula stage is significantly greater than all genes 
(P < 0.0001) compared with the average expression of all 
32 321 gene probes interrogated. This is consistent with 
the view that the sperm-delivered pri-miR-181c is pro- 
cessed to regulate genes of developmental consequence 
following delivery to the oocyte. Recent work has 
demonstrated that mature miR-181c modulates the ex- 
pression levels of CARM1, a key regulator of critical 
pluripotency factors in human and mouse embryonic 
stem cells and blastomeres (79-81). 

Sperm RNA integrity 

A computational approach was developed to globally 
identify the population of intact RNAs amongst the 
various fragmented forms. The RNA-seq profiles from 
the 1000 most abundant T-sperm RNAs (Sample 
Sp_D62[Tt]) were used to establish a means of measuring 
individual transcript integrity. This analysis was 
predicated on the assumption that sequencing coverage 
would vary less across the length of an intact transcript 
relative to one which was fragmented. To determine the 
uniformity of coverage for each, RNA transcripts were 
divided into 100 bins, and a 5-bin moving average was 
used to calculate localized variations in sequencing 
coverage. The squared deviation from expected coverage 
for each bin was summed and used as an intactness score 
to rank the 1000 RNAs according to their stability 
(Figure 4A). This highlighted a subpopulation of intact 
RNAs (top, yellow) while accentuating the biased repre- 
sentation of the 5'-end of the transcripts and variable 
depth of sequencing coverage (bottom left, red). It is im- 
portant to note that shortened 3' UTRs due to APA 
(see Alternative Polyadenylation and Isoforms) may 
contribute to decreased 3' representation of some tran- 
scripts. Three mRNAs of interest were selected from the 
top 200 most intact profiles to test in silico predictions of 
stability by RT-PCR using transcript spanning primers 



(Supplementary Table SI and Figure S3). These included 
NDUFA13, IZUM04 and CIB1, which respectively 
encode a component of complex I of the mitochondrial 
respiratory chain, a protein involved in sperm-egg fusion 
and a calcium-binding protein essential to spermatogen- 
esis (82-84). As described, three non-coding RNAs, 
snaR-Gl, RNU11 and lnc-ERGICl-1, which were not con- 
sidered in the above integrity analysis due to their reduced 
length were also evaluated. As shown in Supplementary 
Figure S3, full length cDNA products corresponding to 
these transcripts were detected in all samples. The 
observed differential fragmentation and preservation of 
select RNAs may reflect the diverse roles of these tran- 
scripts during spermiogenesis or following fertilization. To 
determine if RNA stability was functionally correlated, 
the ranked transcripts were divided into quintiles and sub- 
jected to ontological analysis (1 most intact, 5 least intact; 
Supplementary Table SI 1). Transcripts in the first quintile 
were associated with mature spermatids (24 genes, P = 2.8 
e-16) and male infertility (26 genes, P = 3.8 e-8). 
Subsequent quintiles generally highlighted ontological 
classes that included posttranscriptional regulation of 
gene expression (12 genes, P = 4.9 e-6) and RNA 
binding (20 genes, P = 9.4 e-5). The number of RNAs 
found within mature spermatids and male infertility 
ontologies decreased with each subsequent quintile, with 
not one spermatid or infertility transcript being observed 
in the 5th and most deteriorated. 

A scaling function was then developed to highlight dir- 
ectionality and degree of fragmentation. As shown in 
Figure 4B, the average sequencing coverage profile for 
each ranked and ordered decile highlighted three general 
categories of RNA stability: flat/intact (green; deciles 1-3), 
5' truncated (orange; decile 4) and 3' truncated (purple; 
deciles 5-10). The presence of a pronounced curve at 
either end of a profile was observed for those sets of 
RNAs with a low intactness score. Interestingly, this 
analysis uncovered a group of 5' truncated RNAs that 
were not readily observed in Figure 4A. These 
compromised sets were only weakly, if at all, associated 
with the ontological categories shared between the flat 
intact RNA groups. Though it is unlikely that truncated 
sperm RNAs serve any functional role following fertiliza- 
tion, some may serve as clinical biomarkers of successful 
spermatogenesis alongside the subpopulation of 
full-length transcripts identified above. 

In contrast, the profiles that exhibited reduced variance 
in Relative Sequence Coverage were comprised mainly of 
transcripts with a high intactness score in the prior 
ranking and were enriched in similar ontological classes 
as the top intact quintile. These RNAs appear preserved 
and therefore are the most likely group to contain candi- 
dates for function following fertilization. This set of intact 
transcripts likely remains incomplete, reflecting the com- 
putational and technical (26) challenge of identifying 
full-length RNAs amongst a larger pool of transcriptional 
remnants. However, included in this preliminary set of flat 
and intact sperm RNAs is INTS1. This transcript encodes 
a subunit of the integrator complex that is required to 
process components of spliceosome. Instl knockout mice 
arrest at the early blastocyst stage and exhibit a 
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Figure 4. The most abundant sperm RNAs exhibit differential levels of stability. (A) Normalized RNA-seq coverage across the coding regions and 
UTRs of the 1000 most abundant sperm transcripts at a 100-bin resolution was used to identify intact RNAs. Transcripts are ranked in descending 
order of intactness and were divided into quintiles prior to ontological analysis. Profiles corresponding to stable RNAs exhibit uniform se- 
quence coverage across the transcript (top, yellow). Profiles exhibiting increasing heterogeneity of coverage (bottom, red) are indicative of truncation. 
(B) A scaling function was applied to the above integrity scores to rank transcripts by their directionality and degree of fragmentation. The average 
sequencing coverage profiles for decile bins are presented. The expected level of coverage for an ideal transcript is shown as a dashed line. 
To highlight their likely functional importance, the flattest transcript deciles are prioritized to the foreground. 
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concomitant accumulation and reduction of precursor and 
mature U2 snRNA, respectively (85). The level of this 
transcript increases (fold-change 1.5) in human two-cell 
embryos relative to the unfertilized oocyte, suggesting a 
paternal contribution (30). 



Epigenetic modification of sperm transcript-associated 
genes 

During spermatogenesis, the majority of the nucleosomes 
are replaced with protamines, compacting chromatin to a 
near crystalline state (86,87). This transition to a 
protamine-bound genome is evident from the significant 
underrepresentation in sperm of the >100 transcripts 
encoding both canonical and variant histones present in 
testes and/or somatic cells. Confirming previous results, 
some histone transcripts, including HIST1H1T, 
HIST1H2AA, HIST1H2BA (TSH2B) and H1FNT, 
revealed by RNA-seq are more abundant in testes than 
in somatic cells (UHR; 88). Additionally, histone 
variants H1FNT, H2AFJ, H3F3C and HILS1 are 
retained at a fairly high level in mature sperm. While 



H1FNT (H1T2) plays a role in spermatid elongation 
and DNA condensation (89,90) and HILS1 is similarly 
noted to aid in chromatin remodelling during spermio- 
genesis (91), the specific function of the other sperm- 
retained histone RNAs is not yet known. They may 
represent the last histones used prior to replacement 
by the transition proteins and protamines, perhaps 
facilitating chromatin compaction or essential for 
paternal imprinting. 

A comparatively small fraction of the paternal genome 
(5-15%) escapes protamination (92). Perhaps these 
regions encode transcripts required during compaction, 
encompass regions important early in development, or 
ensure protamine replacement with histones upon fertil- 
ization (8,93,94). These nucleosome-bound regions may be 
further differentiated based on presence or absence of 
specific histone modifications, DNA methylation and 
other genomic marks. The relationship between retained 
or specifically modified histones as a function of the 
associated sperm RNAs is illustrated in Figure 5. Sperm 
RNAs are associated with HMRs, H3K4me3 marks and 
histone-retained regions but not H3K27me3 enrichment. 
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Figure 5. Sperm RNAs are correlated with the genomic positioning of histones and epigenetic elements. A total of 19 521 highly expressed RNA-seq 
clusters were identified in the promoter and exonic regions of sperm sample Sp_D62[Tt]. Each distance correlation is centered on the genomic 
coordinates of elements within one of the four following classes, H3K4me3 and H3K27me3 enriched regions (n = 34912 and 38 337, respectively; 95), 
histone-enriched regions (n = 25 114; 33) and hypomethylated DNA regions (n = 79 124; 34). 
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The high level of H3K4me3 observed in sperm in associ- 
ation with developmental factors EVX1, EVX2, ID1, 
STAT3, KLF5, FGF9, SOX7 and SOX9 has previously 
been suggested to mark sites poised for transcription fol- 
lowing fertilization (33). Although these genes are marked 
by histone modifications correlated with active transcrip- 
tion when delivered to the oocyte (95-97), the coincident 
occurrence of their corresponding transcripts in sperm 
must be considered. Perhaps the H3K4me3 mark 
correlated with the high levels of EVX2, KLF5, SOX9 
and EVX1 RNAs retained by the mature male gamete 
represents a transcriptional ghost. In contrast, the devel- 
opmental regulatory HOX genes are not expressed in 
spermatogenesis and yet they are hypomethylated and 
enriched in nucleosomes, some of which contain the 
active H3K4me3 and repressive H3K27me2 modified 
histones (Supplementary Figure S4A and C; 33). 
Accordingly, the transcriptionally silent yet epigenetically 
poised bipotential state of the HOX clusters in sperm 
infers that these genes are packaged for early embryonic 
use. These domains also harbour EVX1 and EVX2, two 
regulators of early embryonic development, which are 
evolutionarily related to the HOX genes (98,99). 
Examination within the vicinity of EVX1 and EVX2 
revealed significant retention of an RNA encoded by the 
last two exons of EVX2 in addition to transcripts corres- 
ponding to sequences proximal to the 3 / -ends of both 
genes (Supplementary Figure S5B and D). Reads from 
these regions appear unique to sperm compared with 
testes, and may be indicative of novel non-coding, poten- 
tially regulatory transcripts. 

The population of RNA in sperm as revealed by 
RNA-seq provides a window into the developmental 
history, functional viability and potential elements 
present by sperm that may serve a role in the final 
stages of spermiogenesis or at fertilization. These 
include a number of coding and non-coding RNAs, the 
primary role of which in either sperm or the oocyte can 
now be delineated. Correlation of RNA abundance in 
sperm with epigenetic marks has permitted a more 
complete picture of how these marks may act to 
regulate both the pre- and post-fertilization genome. 
The observation that an experimentally derived mouse 
gynogenote may not strictly require a paternal contribu- 
tion (100) remains to be reconciled. It is likely that in this 
case, the confrontation-consolidation pathway [reviewed 
in (101)], in which the paternal RNAs may participate in 
genome compatibility recognition, was sidestepped. 
Further, thought must be given to the documented role 
of the paternally derived mir-34c mediating first cell 
division (102), the most abundant human sperm micro 
RNA (10) or the function of other RNAs as paternal 
epigenetic modifiers [reviewed in (103)]. The identifica- 
tion of transcripts based on traits including relative 
abundance, integrity, knock-out phenotypes and 
presence or absence in the unfertilized oocyte, provides 
the essential foundation to elucidate the role of male 
factors critically important to the fertilization, birth 
and long-term health of a child. 
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